Chi-square goodness-of-fit test (χ² GOF)#
The chi-square goodness-of-fit test answers:
“Do these observed category counts look like they came from a specified categorical distribution?”
It’s the right tool when you have:
one categorical variable (k categories)
counts per category (not measurements)
expected probabilities for each category under the null hypothesis
Learning goals#
By the end you should be able to:
set up \(H_0/H_1\) for χ² GOF
compute the χ² statistic, degrees of freedom, and a p-value
interpret the result correctly (what “reject” / “fail to reject” mean)
diagnose which categories drive the result (residuals + contributions)
implement the test from scratch with NumPy
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from math import erf, sqrt
pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
1) What it’s used for (and what it’s not)#
Used for#
You use χ² GOF when you want to compare observed counts to an expected distribution.
Examples:
Is a die fair (uniform probabilities across 6 faces)?
Does a website’s traffic split match a target mix (e.g., 50% mobile, 30% desktop, 20% tablet)?
Do survey responses match a claimed distribution?
Not the same as…#
χ² test of independence: compares two categorical variables via a contingency table.
KS / Anderson–Darling: goodness-of-fit for continuous distributions (without binning).
Conceptually, χ² GOF is a large-sample approximation to a multinomial model under \(H_0\).
2) Hypotheses#
Suppose there are \(k\) categories with expected probabilities:
With observed counts \(\mathbf{O}=(O_1,\dots,O_k)\) and total \(n=\sum_i O_i\):
Null \(H_0\): the data were generated with probabilities \(\mathbf{p}_0\)
Alternative \(H_1\): at least one category probability differs from \(\mathbf{p}_0\)
This is a global test: it detects that something differs, then you inspect residuals to see where.
3) The test statistic (Pearson’s χ²)#
Under \(H_0\), the expected count in category \(i\) is:
Pearson’s chi-square statistic is:
Why this form?#
\(O_i-E_i\) is the raw discrepancy.
Dividing by \(E_i\) scales it by the typical size of fluctuations (variance grows with \(E_i\)).
Squaring makes positive/negative deviations add up.
A helpful diagnostic is the standardized residual:
So χ² is literally the sum of squared standardized residuals.
# Warm-up: see the pieces on a tiny example
observed_counts = np.array([18, 22, 20])
expected_probs = np.array([1 / 3, 1 / 3, 1 / 3])
n = observed_counts.sum()
expected_counts = n * expected_probs
differences = observed_counts - expected_counts
standardized_residuals = differences / np.sqrt(expected_counts)
contributions = differences**2 / expected_counts
np.column_stack(
[observed_counts, expected_counts, differences, standardized_residuals, contributions]
)
array([[18. , 20. , -2. , -0.4472, 0.2 ],
[22. , 20. , 2. , 0.4472, 0.2 ],
[20. , 20. , 0. , 0. , 0. ]])
The columns above are:
observed \(O_i\)
expected \(E_i\)
difference \(O_i-E_i\)
standardized residual \(R_i=(O_i-E_i)/\sqrt{E_i}\)
contribution \((O_i-E_i)^2/E_i\) (these sum to χ²)
4) Degrees of freedom#
If all \(k\) probabilities are fully specified under \(H_0\) (no parameters estimated from the data), then:
Why \(k-1\) and not \(k\)?
The probabilities must sum to 1, so only \(k-1\) of the counts can vary freely.
If you estimated \(m\) parameters from the data (e.g., you fit a Poisson rate \(\lambda\) before comparing binned counts), you reduce df:
This matters: estimating parameters makes the null more flexible, so you need fewer df.
5) Assumptions / when the approximation is ok#
χ² GOF relies on an asymptotic (large-sample) approximation.
Common practical checks:
Observations are independent.
Categories are mutually exclusive and collectively exhaustive.
Expected counts are not too small.
Rule of thumb: all \(E_i \ge 5\) (or at least 80% ≥ 5 and none < 1).
If expected counts are small:
combine rare categories, or
use a Monte Carlo / exact multinomial approach.
Also remember:
The test is sensitive to sample size: with large \(n\), tiny deviations can become statistically significant.
6) From-scratch implementation (NumPy-only core)#
We’ll implement:
χ² statistic and per-category contributions
degrees of freedom
p-value via Monte Carlo sampling from a χ² distribution (using only NumPy)
a fast p-value approximation (Wilson–Hilferty)
a simple effect size: Cramér’s V for GOF
We’ll then compare with SciPy as a practical reference.
def _as_1d_nonnegative(name, array_like):
values = np.asarray(array_like, dtype=float)
if values.ndim != 1:
raise ValueError(f"{name} must be a 1D array-like.")
if not np.all(np.isfinite(values)):
raise ValueError(f"{name} must contain only finite values.")
if np.any(values < 0):
raise ValueError(f"{name} must be non-negative.")
return values
def chi_square_gof_statistic(observed_counts, *, expected_probs=None, expected_counts=None):
'''Compute Pearson's χ² GOF statistic and diagnostics.
Provide either expected_probs (summing to 1) or expected_counts.
'''
observed_counts = _as_1d_nonnegative("observed_counts", observed_counts)
total_count = float(observed_counts.sum())
if total_count <= 0:
raise ValueError("observed_counts must sum to > 0.")
if expected_counts is None:
if expected_probs is None:
raise ValueError("Provide expected_probs or expected_counts.")
expected_probs = _as_1d_nonnegative("expected_probs", expected_probs)
if expected_probs.shape != observed_counts.shape:
raise ValueError("expected_probs must have the same shape as observed_counts.")
prob_sum = float(expected_probs.sum())
if not np.isclose(prob_sum, 1.0):
raise ValueError(f"expected_probs must sum to 1. Got {prob_sum}.")
expected_counts = total_count * expected_probs
else:
expected_counts = _as_1d_nonnegative("expected_counts", expected_counts)
if expected_counts.shape != observed_counts.shape:
raise ValueError("expected_counts must have the same shape as observed_counts.")
expected_sum = float(expected_counts.sum())
if expected_sum <= 0:
raise ValueError("expected_counts must sum to > 0.")
# Rescale to match the observed total (helps with rounding or probabilities * n).
expected_counts = expected_counts * (total_count / expected_sum)
if np.any(expected_counts <= 0):
raise ValueError("All expected counts must be > 0 (no zero-probability categories).")
differences = observed_counts - expected_counts
contributions = differences**2 / expected_counts
chi2_stat = float(contributions.sum())
standardized_residuals = differences / np.sqrt(expected_counts)
return {
"chi2": chi2_stat,
"observed": observed_counts,
"expected": expected_counts,
"differences": differences,
"contributions": contributions,
"standardized_residuals": standardized_residuals,
"n": total_count,
"k": int(observed_counts.size),
}
def chi_square_gof_df(k, ddof=0):
df = int(k - 1 - ddof)
if df <= 0:
raise ValueError(
f"Degrees of freedom must be positive. Got df={df} (k={k}, ddof={ddof})."
)
return df
def chi_square_sf_mc(x, df, *, n_sim=200_000, batch_size=50_000, rng=None):
'''Monte Carlo estimate of P(ChiSq_df >= x).
Uses ChiSq_df = sum_{j=1..df} Z_j^2 where Z_j ~ N(0,1).
'''
rng = np.random.default_rng() if rng is None else rng
x = float(x)
if x < 0:
return 1.0, 0.0
n_sim = int(n_sim)
batch_size = int(batch_size)
if n_sim <= 0 or batch_size <= 0:
raise ValueError("n_sim and batch_size must be positive integers.")
count_ge = 0
remaining = n_sim
while remaining > 0:
current = min(batch_size, remaining)
sims = np.square(rng.standard_normal(size=(current, df))).sum(axis=1)
count_ge += int(np.count_nonzero(sims >= x))
remaining -= current
p_value = count_ge / n_sim
se = sqrt(p_value * (1 - p_value) / n_sim)
return p_value, se
def chi_square_sf_wilson_hilferty(x, df):
'''Fast upper-tail approximation using the Wilson–Hilferty transform.'''
x = float(x)
df = float(df)
if x <= 0:
return 1.0
z = ((x / df) ** (1 / 3) - (1 - 2 / (9 * df))) / sqrt(2 / (9 * df))
phi = 0.5 * (1.0 + erf(z / sqrt(2.0)))
return 1.0 - phi
def cramers_v_gof(chi2_stat, n, k):
return sqrt(float(chi2_stat) / (float(n) * (k - 1)))
def chi_square_gof_test(
observed_counts,
*,
expected_probs=None,
expected_counts=None,
ddof=0,
alpha=0.05,
n_sim=200_000,
rng=None,
):
details = chi_square_gof_statistic(
observed_counts, expected_probs=expected_probs, expected_counts=expected_counts
)
df = chi_square_gof_df(details["k"], ddof=ddof)
p_mc, p_mc_se = chi_square_sf_mc(details["chi2"], df, n_sim=n_sim, rng=rng)
p_wh = chi_square_sf_wilson_hilferty(details["chi2"], df)
return {
**details,
"df": df,
"alpha": float(alpha),
"p_value_mc": float(p_mc),
"p_value_mc_se": float(p_mc_se),
"p_value_wh": float(p_wh),
"reject_h0_mc": bool(p_mc < alpha),
"cramers_v": float(cramers_v_gof(details["chi2"], details["n"], details["k"])),
}
7) Example: is a die fair?#
We roll a 6-sided die \(n\) times.
\(H_0\): each face has probability \(1/6\)
\(H_1\): at least one face differs
We’ll look at two datasets:
a sample from a fair die (should usually not reject)
a sample from a biased die (should reject)
faces = np.arange(1, 7)
uniform_probs = np.ones(6) / 6
n_rolls = 120
rng_example = np.random.default_rng(1)
observed_fair = rng_example.multinomial(n_rolls, uniform_probs)
observed_biased = rng_example.multinomial(n_rolls, np.array([0.10, 0.10, 0.10, 0.10, 0.10, 0.50]))
rng_mc_fair = np.random.default_rng(100)
rng_mc_biased = np.random.default_rng(101)
fair_result = chi_square_gof_test(observed_fair, expected_probs=uniform_probs, n_sim=200_000, rng=rng_mc_fair)
biased_result = chi_square_gof_test(observed_biased, expected_probs=uniform_probs, n_sim=200_000, rng=rng_mc_biased)
{
"observed_fair": observed_fair,
"chi2_fair": fair_result["chi2"],
"p_mc_fair": fair_result["p_value_mc"],
"reject_fair@0.05": fair_result["reject_h0_mc"],
"observed_biased": observed_biased,
"chi2_biased": biased_result["chi2"],
"p_mc_biased": biased_result["p_value_mc"],
"reject_biased@0.05": biased_result["reject_h0_mc"],
}
{'observed_fair': array([20, 27, 14, 26, 15, 18]),
'chi2_fair': 7.5,
'p_mc_fair': 0.18619,
'reject_fair@0.05': False,
'observed_biased': array([11, 15, 11, 12, 6, 65]),
'chi2_biased': 123.6,
'p_mc_biased': 0.0,
'reject_biased@0.05': True}
from plotly.subplots import make_subplots
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=("Sample from a fair die", "Sample from a biased die"),
)
for col, observed in [(1, observed_fair), (2, observed_biased)]:
expected = observed.sum() * uniform_probs
fig.add_trace(go.Bar(name="Observed", x=faces, y=observed), row=1, col=col)
fig.add_trace(
go.Bar(name="Expected (H0)", x=faces, y=expected, marker_color="rgba(0,0,0,0.25)"),
row=1,
col=col,
)
fig.update_layout(
title="Observed vs expected counts (χ² GOF setup)",
barmode="group",
legend_title_text="Counts",
)
fig.update_xaxes(title_text="Face", row=1, col=1)
fig.update_xaxes(title_text="Face", row=1, col=2)
fig.update_yaxes(title_text="Count", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=2)
fig.show()
Which categories drive the result?#
Two useful per-category diagnostics:
Standardized residuals: \(R_i=(O_i-E_i)/\sqrt{E_i}\) (signed)
Contributions: \((O_i-E_i)^2/E_i\) (non-negative)
Large absolute residuals / contributions point to categories that disagree most with \(H_0\).
fig = make_subplots(
rows=2,
cols=1,
shared_xaxes=True,
vertical_spacing=0.1,
subplot_titles=("Standardized residuals (signed)", "χ² contributions (sum = χ²)"),
)
fig.add_trace(
go.Bar(x=faces, y=biased_result["standardized_residuals"], name="Residual"),
row=1,
col=1,
)
fig.add_hline(y=0, line_width=1, line_color="black", row=1, col=1)
fig.add_trace(
go.Bar(x=faces, y=biased_result["contributions"], name="Contribution", marker_color="#636EFA"),
row=2,
col=1,
)
fig.update_layout(title="Diagnostics for the biased sample")
fig.update_xaxes(title_text="Face", row=2, col=1)
fig.update_yaxes(title_text="Residual", row=1, col=1)
fig.update_yaxes(title_text="Contribution", row=2, col=1)
fig.show()
Visualizing the p-value (null distribution)#
Under \(H_0\), the χ² statistic follows a χ² distribution (approximately).
A p-value is:
So it’s a tail probability under the null model.
df = biased_result["df"]
chi2_obs = biased_result["chi2"]
alpha = biased_result["alpha"]
rng_null = np.random.default_rng(123)
null_samples = np.square(rng_null.standard_normal(size=(60_000, df))).sum(axis=1)
critical_value = float(np.quantile(null_samples, 1 - alpha))
fig = go.Figure()
fig.add_histogram(
x=null_samples[null_samples < chi2_obs],
nbinsx=60,
name="Null samples (< observed)",
marker_color="lightgray",
opacity=0.8,
)
fig.add_histogram(
x=null_samples[null_samples >= chi2_obs],
nbinsx=60,
name="Tail samples (≥ observed)",
marker_color="crimson",
opacity=0.9,
)
fig.add_vline(
x=chi2_obs,
line_width=3,
line_color="crimson",
annotation_text=f"Observed χ² = {chi2_obs:.2f}",
annotation_position="top right",
)
fig.add_vline(
x=critical_value,
line_width=2,
line_dash="dash",
line_color="black",
annotation_text=f"Critical (α={alpha:.2f}) ≈ {critical_value:.2f}",
annotation_position="top left",
)
fig.update_layout(
title=f"Null distribution for χ²(df={df}) with observed statistic (biased sample)",
barmode="overlay",
xaxis_title="χ² statistic",
yaxis_title="Frequency",
)
fig.show()
{
"df": df,
"chi2_obs": chi2_obs,
"p_value_mc": biased_result["p_value_mc"],
"p_value_mc_se": biased_result["p_value_mc_se"],
"p_value_wh_approx": biased_result["p_value_wh"],
}
{'df': 5,
'chi2_obs': 123.6,
'p_value_mc': 0.0,
'p_value_mc_se': 0.0,
'p_value_wh_approx': 0.0}
8) How to interpret the result#
Pick a significance level α (often 0.05).
If p-value < α: reject \(H_0\)
The observed counts are unlikely if the expected distribution were true.
Interpretable as evidence of a mismatch.
If p-value ≥ α: fail to reject \(H_0\)
The data are reasonably consistent with the expected distribution.
This is not proof that \(H_0\) is true (you may have low power).
What it does not mean#
The p-value is not \(\Pr(H_0 \mid \text{data})\).
A non-significant p-value does not mean “the distributions are identical”.
Effect size (optional but useful)#
For goodness-of-fit, a common effect size is Cramér’s V:
\(V \approx 0\) means very close to the expected distribution.
For fixed proportional deviations, χ² grows with \(n\), but \(V\) stays roughly stable.
9) Example: known non-uniform expected mix (candy colors)#
Suppose a brand claims the following color mix:
Blue 24%, Orange 20%, Green 16%, Yellow 14%, Red 13%, Brown 13%
We sample a bag and count colors. Are the counts consistent with the claim?
colors = np.array(["Blue", "Orange", "Green", "Yellow", "Red", "Brown"])
expected_probs_claim = np.array([0.24, 0.20, 0.16, 0.14, 0.13, 0.13])
# A sample that is slightly different from the claim
observed_colors = np.array([40, 50, 30, 28, 28, 24])
rng_mc_colors = np.random.default_rng(202)
color_result = chi_square_gof_test(
observed_colors,
expected_probs=expected_probs_claim,
n_sim=250_000,
rng=rng_mc_colors,
)
{
"observed": dict(zip(colors, observed_colors.astype(int))),
"chi2": color_result["chi2"],
"df": color_result["df"],
"p_value_mc": color_result["p_value_mc"],
"reject@0.05": color_result["reject_h0_mc"],
"cramers_v": color_result["cramers_v"],
}
{'observed': {'Blue': 40,
'Orange': 50,
'Green': 30,
'Yellow': 28,
'Red': 28,
'Brown': 24},
'chi2': 4.266025641025641,
'df': 5,
'p_value_mc': 0.511144,
'reject@0.05': False,
'cramers_v': 0.06531481945948898}
expected_colors = color_result["expected"]
fig = go.Figure()
fig.add_bar(name="Observed", x=colors, y=observed_colors)
fig.add_bar(name="Expected (claim)", x=colors, y=expected_colors, marker_color="rgba(0,0,0,0.25)")
fig.update_layout(
title="Candy colors: observed vs expected",
barmode="group",
xaxis_title="Color",
yaxis_title="Count",
)
fig.show()
10) Sample size sensitivity (why χ² can flag tiny deviations)#
If the true proportions differ from \(\mathbf{p}_0\) by some small amount, the χ² statistic tends to grow with \(n\).
We’ll keep the proportional deviation fixed and increase \(n\).
def counts_from_probs(n, probs):
probs = np.asarray(probs, dtype=float)
if not np.isclose(probs.sum(), 1.0):
raise ValueError("probs must sum to 1")
raw = n * probs
counts = np.floor(raw).astype(int)
remainder = int(n - counts.sum())
if remainder > 0:
frac = raw - np.floor(raw)
add_to = np.argsort(frac)[::-1][:remainder]
counts[add_to] += 1
return counts
# Base expectation: uniform over 6 categories
p0 = np.ones(6) / 6
# A small deviation that still sums to 0
p_true = p0 + np.array([0.02, -0.01, 0.01, -0.02, 0.00, 0.00])
sample_sizes = np.array([60, 120, 240, 500, 1000, 2000, 5000])
chi2_values = []
p_values = []
from scipy.stats import chi2
for n in sample_sizes:
observed = counts_from_probs(int(n), p_true)
result = chi_square_gof_statistic(observed, expected_probs=p0)
df = chi_square_gof_df(result["k"])
chi2_values.append(result["chi2"])
p_values.append(float(chi2.sf(result["chi2"], df)))
fig = make_subplots(rows=1, cols=2, subplot_titles=("χ² grows with n", "p-value shrinks with n"))
fig.add_trace(go.Scatter(x=sample_sizes, y=chi2_values, mode="lines+markers", name="χ²"), row=1, col=1)
fig.add_trace(
go.Scatter(x=sample_sizes, y=p_values, mode="lines+markers", name="p-value", marker_color="crimson"),
row=1,
col=2,
)
fig.update_xaxes(title_text="n", row=1, col=1)
fig.update_xaxes(title_text="n", row=1, col=2)
fig.update_yaxes(title_text="χ²", row=1, col=1)
fig.update_yaxes(title_text="p-value", type="log", row=1, col=2)
fig.update_layout(title="Same proportional deviation, different sample sizes")
fig.show()
11) Practical check with SciPy#
SciPy has a built-in implementation that matches the formula:
scipy.stats.chisquare(f_obs, f_exp, ddof=...)
It returns the χ² statistic and an exact p-value from the χ² distribution.
from scipy.stats import chisquare
chi2_scipy, p_scipy = chisquare(observed_biased, f_exp=(observed_biased.sum() * uniform_probs))
{
"numpy_stat": biased_result["chi2"],
"scipy_stat": float(chi2_scipy),
"scipy_p_value": float(p_scipy),
}
{'numpy_stat': 123.6,
'scipy_stat': 123.6,
'scipy_p_value': 5.419327083852984e-25}
12) Checklist + common pitfalls#
Use counts, not percentages (convert percentages to counts by multiplying by \(n\)).
Make sure expected probabilities sum to 1 and all expected counts are > 0.
Watch out for small expected counts (combine categories or use Monte Carlo / exact).
If you estimated parameters to build \(\mathbf{p}_0\) from the data, adjust df: \(\nu=(k-1)-m\).
A significant result doesn’t tell you which categories differ—use residuals / contributions.
Exercises#
Simulate many fair-die experiments and estimate the false-positive rate at α=0.05.
Create a biased die with a subtle bias (e.g., 0.18 on one face) and see how large \(n\) must be to detect it reliably.
Try binning continuous data into categories and apply χ² GOF, adjusting df if you estimated parameters.
References#
Pearson, K. (1900). On the criterion… (original χ² idea)
Any introductory statistics text on chi-square tests